Test Collections and Measures for Evaluating Customer-Helpdesk Dialogues
نویسندگان
چکیده
We address the problem of evaluating textual, task-oriented dialogues between the customer and the helpdesk, such as those that take the form of online chats. As an initial step towards evaluating automatic helpdesk dialogue systems, we have constructed a test collection comprising 3,700 real Customer-Helpdesk multiturn dialogues by mining Weibo, a major Chinese social media. We have annotated each dialogue with multiple subjective quality annotations and nugget annotations, where a nugget is a minimal sequence of posts by the same utterer that helps towards problem solving. In addition, 10% of the dialogues have been manually translated into English. We have made our test collection DCH-1 publicly available for research purposes. We also propose a simple nugget-based evaluation measure for task-oriented dialogue evaluation, which we call UCH, and explore its usefulness and limitations.
منابع مشابه
Multi-domain case-based module for customer support
Technology Management Centres provide technological and customer support services for private or public organisations. Commonly, these centres offer support using a helpdesk software that facilitates the work of their operators. In this paper, a CBR module that acts as a solution recommender for customer support environments is presented. The CBR module is flexible and multi-domain, in order to...
متن کاملDynamic active probing of helpdesk databases
Helpdesk databases are used to store past interactions between customers and companies to improve customer service quality. One common scenario of using helpdesk database is to find whether recommendations exist given a new problem from a customer. However, customers often provide incomplete or even inaccurate information. Manually preparing a list of clarification questions does not work for l...
متن کاملDecision Trees for Helpdesk Advisor Graphs
We use decision trees to build a helpdesk agent reference network to facilitate the on-the-job advising of junior or less experienced staff on how to better address telecommunication customer fault reports. Such reports generate field measurements and remote measurements which, when coupled with location data and client attributes, and fused with organization-level statistics, can produce model...
متن کاملTowards Automatic Evaluation of Multi-Turn Dialogues: A Task Design that Leverages Inherently Subjective Annotations
ABSTRACT is paper proposes a design of a shared task whose ultimate goal is automatic evaluation of multi-turn, dyadic, textual helpdesk dialogues. e proposed task takes the form of an oine evaluation, where participating systems are given a dialogue as input, and output at least one of the following: (1) an estimated distribution of the annotators’ quality ratings for that dialogue; and (2)...
متن کاملEvaluating Spoken Language Systems
Spoken language systems (SLSs) for accessing information sources or services through the telephone network and the Internet are currently being trialed and deployed for a variety of tasks. Evaluating the usability of different interface designs requires a method for comparing performance of different versions of the SLS. Recently, Walker et al (1997) proposed PARADISE (PARAdigm for DIalogue Sys...
متن کامل